This project analyses Boulder B-cycle data to understand and document any patterns from 2013 to Early 2016. The data analysis in this project is through summaries and visualizations. Also part of the project is to try and apply Machine Learning on the numerical data to try and make predictions.
The analysis is divided into 3 big sections:
Summary of Data
Data Visualizations
Machine Learning to predict the Pass Type
## Rider.Home.System Rider.or.Operator.Number Entry.Pass.Type Bike.Number
## 1 Boulder B-cycle R1011535 24-hour 548
## 2 Boulder B-cycle R1011722 24-hour 742
## 3 Boulder B-cycle R1008367 Annual 578
## 4 Boulder B-cycle R1010650 24-hour 616
## 5 Boulder B-cycle R1008367 Annual 578
## 6 Boulder B-cycle R1055681 Annual 601
## Checkout.Date Checkout.Day.of.Week Checkout.Time Checkout.Station
## 1 5/20/2011 Friday 9:24:00 AM 15th & Pearl
## 2 5/20/2011 Friday 9:24:00 AM 15th & Pearl
## 3 5/20/2011 Friday 9:33:00 AM Broadway & Alpine
## 4 5/20/2011 Friday 9:34:00 AM Broadway & Alpine
## 5 5/20/2011 Friday 9:36:00 AM Broadway & Alpine
## 6 5/20/2011 Friday 9:39:00 AM UCAR Center Green
## Return.Date Return.Day.of.Week Return.Time Return.Station
## 1 5/20/2011 Friday 9:40:00 AM 26th @ Pearl
## 2 5/20/2011 Friday 9:54:00 AM 15th & Pearl
## 3 5/20/2011 Friday 9:36:00 AM Broadway & Alpine
## 4 5/20/2011 Friday 9:37:00 AM Broadway & Alpine
## 5 5/20/2011 Friday 9:39:00 AM Broadway & Alpine
## 6 5/20/2011 Friday 9:42:00 AM UCAR Center Green
## Trip.Duration..Minutes.
## 1 16
## 2 30
## 3 3
## 4 3
## 5 3
## 6 3
## Rider.Home.System Rider.or.Operator.Number
## Boulder B-cycle :243333 M9999957: 9499
## Denver B-cycle : 4666 M9999950: 5684
## Madison B-cycle : 201 M9999952: 5538
## Houston B-cycle : 113 R1028713: 4006
## Indy - Pacers Bikeshare: 74 M9999943: 3077
## GREENbike : 38 M9999998: 2835
## (Other) : 119 (Other) :217905
## Entry.Pass.Type Bike.Number Checkout.Date
## 24-hour : 83642 411 : 1821 6/25/2015: 703
## 7-day : 5585 584 : 1755 8/2/2015 : 650
## Annual :113041 666 : 1613 8/8/2015 : 639
## Maintenance : 37337 744 : 1608 7/28/2015: 635
## Semester (150-day): 8939 665 : 1607 6/26/2015: 621
## 699 : 1596 8/5/2015 : 621
## (Other):238544 (Other) :244675
## Checkout.Day.of.Week Checkout.Time Checkout.Station
## Friday :39020 12:16:00 PM: 467 Length:248544
## Monday :35182 12:26:00 PM: 455 Class :character
## Saturday :36603 12:45:00 PM: 447 Mode :character
## Sunday :28767 4:12:00 PM : 434
## Thursday :38079 5:05:00 PM : 433
## Tuesday :34903 12:12:00 PM: 432
## Wednesday:35990 (Other) :245876
## Return.Date Return.Day.of.Week Return.Time
## 6/25/2015: 706 Friday :39026 12:04:00 AM: 495
## 8/2/2015 : 651 Thursday :37881 1:13:00 PM : 451
## 8/8/2015 : 637 Saturday :36322 12:12:00 PM: 441
## 7/28/2015: 629 Wednesday:36042 1:51:00 PM : 439
## 6/26/2015: 624 Monday :35362 12:15:00 PM: 437
## 7/11/2015: 624 Tuesday :34944 12:52:00 PM: 436
## (Other) :244673 (Other) :28967 (Other) :245845
## Return.Station Trip.Duration..Minutes.
## Length:248544 Min. : -2.00
## Class :character 1st Qu.: 5.00
## Mode :character Median : 12.00
## Mean : 63.36
## 3rd Qu.: 26.00
## Max. :181607.00
##
There are 248544 observations of 13 variables. The variables include Checkout/Return Stations, Checkout/Return Time, Type of Pass, Day of the Week, Trip Duration, Bike Number and Rider/Operator number. Also included is a location dataset with latitude and longitude information along with other information about the Checkout/Return stations
There are some errors in the “Rider.Home.System” column. This data is supposed to be data for Boulder but the home system was set to Denver and Houston, this is not correct. This is not a big deal though because this variable/column data is probably not that important in my analysis because it’s a constant.
NOTE: I also made corrections to Checkout/Return Station “RTD”, which is really “14th & Canyon” but entered incorrectly as RTD. I found this error later in my project as I did my analysis but I corrected it early on.
This section involves a lot of visualizations. It’s a combination of univariate and multivariate plots, with focus on one variable at a time.
Fig1: Rider/Operator Count
Fig2: Rider/Operator Count seperated by Pass Type
NOTE: There were a lot of riders between 1-200 rides and to understand any patterns better, the data was subset to riders with 200 or more rides.
The following trends with using a subset of the data can be noted.
Some riders really like to use the B-cycle for their rides. Faceting it by the pass type, we get a better understanding of what type of passes they like to use. Annual Pass is the biggest winner among people who use the bikes often(not surprising) but there was a rider who did a little more than 200 rides using the 24 hour pass(surprised that the person didn’t think of other cheaper options).
The number of rides by riders using Maintenance pass is very interesting, there are a lot of rides by a few users. This indicates that these were operators who regularly used and fixed the bikes.
Fig3: Pass Type Count
Fig4: Pass Type Count seperatred by Day of the Week
There are 4 pass types as noted from the plot above. It is clear that the Anuual pass is definitely the most popular, followed by the 24-hour type pass. 150-day and 7-day passes pale in comparison. Maintainance is another one which has relatively high use compared to 7-day and Semester(150-day) type.
From Fig4 one thing which stands out is that 24-hour pass type is used way more than Annual pass on the weekends. whereas on the weekdays Annual pass is still the most widely used. Semester and 7-day pass usage is still very low.
Fig5: Bike# Count seperated by Pass Type
Fig6: Bike# Count seperated by Day of the Week
Fig7: Bike# Count seperated by Pass Type & Day of the Week
Was not expecting any trends when analyzing bike numbers but surprisingly there are some trends.
Except for 7-day and Semester pass types, the bike numbers in the middle seem to be most used. This might be related to the stations they are at, as there are stations which are more popular than other ones as we will see below.
Fig8: Day of the Week Count
Fig9: Day of the Week Count seperated by Pass Type
Friday is overall the most popular day of the week for ridership, followed by Thursday(surprising) and then Saturday. Monday, Tuesday & Wednesday usage is very close, whereas Sunday usage is markedly lower compared to other days
When looking at the data faceted by the pass type, Annual pass holders like to use their passes on weekdays(the distribution is almost gaussian like). It is completely opposite for the users of 24-hour pass type, they like riding on weekends(as we noted in the Pass Type section)
Maintenance rides seems to happen mostly on weekends. Semester pass holders like to use their pass on the weekdays with Tuesday being the most popular.
In the case of 7-day pass, there is no visible trend but Thursday is most popular followed by Friday & Saturday.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00 5.00 12.00 63.36 26.00 181600.00
Fig9:Box plots of Trip Duration
Fig9:Box plots of Trip Duration
Fig9: Trip Duration Distribution
Fig10: Trip Duration distribution seperated by Pass Type
Fig11: Trip Duration distribution seperated by Day of the Week
Fig12: Trip Duration distribution seperated by Pass Type and Day of the Week
This section also uses a subset of the data. There were a lot of outliers in trip duration, so the trip duration to 60 minutes and the subset of the data was used for visualizations.
Overall, the trip duration is a gaussian distribution with the peak at 4-5 minutes and falls pretty hard and kind of stabilizes from the 32 minute mark.
Faceting it by pass type, for the 24 hour pass type, the most common trip duration is 12-13 minutes, 10-13 minutes for 7-day pass, 4-5 minutes for Annual, 1 minute for Maintenance(quick maintenance!) and 6-7 minutes for semester type.
When looking at the plots by the day of the week, 5ish minute period still is the most popular on weekdays but not on weekends. With 10-12 minute seeming to be more popular, may be this is due to the fact that people are not in a hurry on the weekends.
Combining the pass type and weekday, the annual pass holders trip duration pattern doesn’t change much, point 3 still holds true. For 24 hour pass type, the trip duration seems to be in the upper ranges, 10+ minutes. The 7-day pass type trip duration doesn’t show a clear pattern from the plots. 1-minute maintenance seems to be the most common turn around time. 7-8 minutes trip duration seems to still be the most common for semester type pass.
Fig13: Checkout Station Count
Fig14: Checkout Station Count seperated by Day of the Week
Fig15: Checkout Station Count seperated by Pass Type
Fig16: Trip Duration distribution seperated by Checkout Station
15th & Pearl, 13th and Spruce are the 2 most poular check out stations in Boulder. There is a close tie between 11th and Pearl and Municipal Building stations. Greenhouse and Gunbarrel North are the least used stations, 14th and Walnut office might be an error as this location doesn’t have lattitude, longitude listed.
Faceting it by the day of the week, 15th & Pearl is still the most popular checkout station. With 13th and Spruce along with Municipal building being the 2nd most popular checkout stations from Mon-Thu and 11th & Pearl from Fri-Sun.
Analyzing the checkout stations by the pass type, 15th & Pearl is still the most popular checkout station for all pass types except for the semester pass type. For the 24-hour pass type, 11th and Pearl is the 2nd most popular checkout station followed by 19th @ Broadway. The village seems to be the 2nd most popular station for the 7-day pass type. The Annual pass distribution doesn’t change much with the overall pattern from point 1 because this is the most popular pass type
One thing to be noted are the spikes in maintenance in locations like The Village and 26th @ Pearl which are not in line with the overall checkout station popularity pattern. This might indicate that the bikes there might have been subject to more rough use or a batch of bikes which were stationed there had some defect.
Faceting the trip duration by Checkout station there are not any major surprises and the overall pattern across the popular stations seems to be still true, ride times being in the 6-10 minutes range
Fig17: Return Station Count seperated by Pass Type
Fig18: Return Station Count seperated by Day of the Week
Fig19: Return Station Count seperated by Pass Type
Fig20: Trip Duration distribution seperated by Return Station
No surprises from the Return Station analysis, most if not all of the points from the pervious section apply here as well.
Fig21: Checkout Date distribution
Fig22: Checkout Date distribution seperated by Pass Type
The number of checkouts has progressively increased over the years from year to year. There is definitly a pattern in terms of usage, the summer(May-August) months seeing an increase in checkouts but there is a dip on on either side of the summer months. This definitely makes sense as people tend to ride less in the winter months. Among the popular summer months, July-August have the biggest checkouts across the years
Viewing the plots by the type of the pass, we can see that all pass types have seen an increase in usage since Boulder B-cycle was introduced. 7-day pass saw a big increase in the summer of 2015 and the Semester type pass also saw a big increase since it was introduced in early 2014.
Among the annual pass holders, October of 2015 had more users than any other month in the warmer months. This is surprising, I guess October must have been warm or there must have been a lot of events in the Boulder area.
Maintenance generally follows the trend of an increase in the number of instances of maintenance in the summer months and a decrease in the colder months. One anamoly was that in April of 2015 had the highest instances of maintenance for that year but it wasn’t the most popular month in terms of ridership. This might also indicate that the Boulder B-cycle organization might have been preparing in advance for the popular ridership summer months. This might be a good guess because the maintenance was lower in the months following April for 2015 across all pass types.
Fig23: Checkout Date distribution seperated by Day of the Week
Fig24: Checkout Date distribution seperated by Pass Type and Day of the Week
Analysing checkouts divided by the day of the week. Only Tue-Wed deviate from the general trend that August is the most popular month followed by July. In the case of Tue-Wed the roles get reversed.
Doing a multivariate analysis we can see finer trends in popular days across months and across pass types.
Fig25: Checkout Time Distribution
Fig26: Return Time Distribution
Fig27: Checkout Time distribution seperated by Pass Type
Fig28: Return Time distribution seperated by Pass Type
Checkout/Return times start slowly a little before 7(early, so expected), with a big increase just after 7:00. From that time on, the checkouts/returns slightly decrease but then increase again from 10:00 to 11:15, then seeing a dip again at around 13:00 followed by an increase till 15:00. There is a dip again followed by an increase in ridership after 18:00.
The return and checkout times follow each other closely because the overall most popular times in Boulder is less than 10 minutes.
24 hour pass type holders checkout/return times start off strong in the morning and slowly decrease except for one spike at 10:00 and then starts increasing at around 14:30, hitting a peak at 18:00 and then slowly decreasing.
Among 7-day pass holders 14:30 seems to be the peak for checkout/returns. The increase in checkout/return towards the peak at 14:30 starts at around 11:30. This pattern also holds true for semester pass holders.
Annual pass type usage patterns follows the overall pattern described in point 1. Whereas, for maintenance, the peak is in the morning before 11:00 followed by a big dip and then a big increase after 15:00
Fig29: Checkout Time distribution seperated by Day of the Week
Fig30: Return Time distribution seperated by Day of the Week
Fig31: Checkout Time distribution seperated by Pass Type & Day of the Week
Fig32: Return Time distribution seperated by Pass Type & Day of the Week
Among 24 hour pass holders from Tue-Thu the checkout/return pattern is different from Fri-Mon. Tue-Thu checkout/returns doesn’t dip as much in the middle of the day compared to Fri-Mon.
7-day pass checkout/returns peak at around 15:00 from Mon-Thu, with Fri-Sun seeing peaks and drops throughout the day. The same pattern applies for semester pass holders.
Annual pass holders like to use the service around the 11:00, 15:00 and 18:00 time periods on weekdays. Where as on Saturdays and Sundays the peak usage in the morning and evenings
Maintenance peaks at 15:00 on weekdays and mornings/evenings on Saturdays and Sunday(with a big drop in maintenance in the middle)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boulder,+Colorado&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boulder,%20Colorado&sensor=false
Fig33: Heat Map of Checkout Stations
Fig34: Heat Map of Return Stationss
The size of the circle represents the overall checkouts/returns since B-cycle started. From the two maps it is clear that the stations in downtown are most used. The stations near downtown and in/near the University are the most used behind the downtown stations.
if(FALSE)
{
# Subset the data keeping the necessary variables
mlsubset <- dataset[c(3, 13)]
# Create partition
trainIndex <- createDataPartition(mlsubset$Entry.Pass.Type, p = 0.8, list = FALSE, times = 1)
trainingset <- mlsubset[trainIndex, ]
testset <- mlsubset[-trainIndex, ]
# Get the necessary variables for analysis
# Split the data set for 10-fold cross validation, train on 9, test on 1 for all combinations
trainControl <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
# Evaluate 3 different algorithms, make sure the same seed is used
# Linear Discriminant Analysis
set.seed(7)
fit.lda <- train(Entry.Pass.Type~., data = trainingset, method = "lda",
metric = metric, trControl = trainControl)
# Classification and Regression Test
set.seed(7)
fit.cart <- train(Entry.Pass.Type~., data = trainingset, method = "rpart",
metric = metric, trControl = trainControl)
# Naive Bayes
set.seed(7)
fit.nb <- train(Entry.Pass.Type~., data = trainingset, method = "nb",
metric = metric, trControl = trainControl)
# Summarize accuracy of models
results <- resamples(list(lda = fit.lda, cart = fit.cart, nb = fit.nb))
summary(results)
# Dot plot of the results
dotplot(results)
# Compare against the test set
predictions <- predict(fit.cart, testset)
confusionMatrix(predictions, testset$Entry.Pass.Type)
}
This section explored whether it was possible to use Machine Learning algorithms on the numeric data(trip duration) to predict the pass type. Three algorithms were tested(LDA, CART and Naive Bayes). Among the 3 CART had the best results but the accuracy was still low <70%. Since trip duration was the only quantified numeric data the algorithms didn’t perform that well.
If there was another numeric variable which Boulder B-cycle had provided, may be the distance covered during each trip that would have probably helped with the classification and accuracy of classification.